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Abstract 

The GOR program for predicting protein secondary structure is extended to include triple correla- 
tion. A score system for a residue pair to be at certain conformation state is derived from the conditional 
weight matrix describing amino acid frequencies at each position of a window flanking the pair under the 
condition for the pair to be at the fixed state. A program using this score system to predict protein sec- 
ondary structure is established. After training the model with a learning set created from PDB_SELECT, 
the program is tested with two test sets. As a method using single sequence for predicting secondary 
structures, the approach achieves a high accuracy near 70%. 

PACS number(s): 87.10.+e,02.50.-r 

1 Introduction 

Methods for predicting the secondary structure of a protein from its amino acid sequence have been devel- 
oped for 3 decades. Besides neural network models and nearest-neighbor methods, the statistically based 
Chou-Fasman/GOR method is well-established and commonly used. In 1974, assuming an oversimplified 
independency to cope with the large size 20 of the amino acid alphabets at a small size of database, Chou 
and Fasman (1974) derived a table of propensity for a particular residue to be in a given secondary structure 
state. By combining with a set of rules, the protein secondary structure was predicted using this propensity. 
Later, in the first version of the GOR program (Garnier, Osguthorpe, and Robson, 1978), the state of a 
single residue aj was predicted according to a window from i — 8 to i -|- 8 surrounding the residue. Unlike 
Chou-Fasman which assumes that each amino acid individually influences its own secondary structure state, 
GOR takes into account the influence of the amino acids flanking the central residue on the central residue 
state by deriving an information score from the weight matrix describing 17 individual amino acid frequencies 
at sites i + k with —8 < k < +8. By using a single weight matrix, the correlation among amino acids within 
the window was still ignored. In the later version GOR III (Gibrat, Garnier, and Robson, 1987), instead of 
single weight matrix for every structure state, 20 weight matrices, each of which corresponds to a specific 
type of the central residue, were used. These conditional weight matrices take the pair correlation between 
the central residue and a flanking one into account. In the most recent version of GOR (GOR IV, Garnier, 
Gibrat, and Robson, 1996), all pairwise combinations of amino acids in the flanking region were included. 
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The GOR program maps a local window of residues in the sequence to the structural state of the central 
residue in the window. Correlations among positions within the window is essential for improvements in 
prediction accuracy. Wc give an example to show the importance of high order correlations. For central 
residue ai = K being at the extended strand state, the conditional probability for 0^-3 = y is Pt.^.Q{V\K) = 
0.088, while the conditional probability for Oj-s = V a.t Ui = K and Oj+i = E is P1^.q^_^_i{V\KE) = 0.186, 
and _^_i{V\KV) = 0.058. With the growth of protein structure database, now the size of known protein 
structures allows us to consider correlations higher than pair ones. Here we shall extend the GOR program 
to include triple correlations, developing a program to predict protein secondary structure based on residue 
pairs (PSSRP). As a method using single sequence, the computation required is rather light, but its prediction 
accuracy reaches 70%. 



2 Methods 

Kabsch and Sander (1983) define eight states of secondary structure according to the hydrogen-bond pattern. 
As in most methods, we consider 3 states {h, e, c} generated from the 8 by the coarse-graining H,G,I ^ h, 

e&ud X, T,S,B^ c. 

2.1 Window-based scores 

A window is a sequence segment UiUi+i . . . Ui+i-i of the width I. Consider two residues Ci+j = x and 
o-i+k = y inside the window (with 0<j<k<l — 1). Their conformation state are a and /3, respectively. 
Let us denote by w the set of the sites within the window with i + j and i + k excluded. When discussing 
probability, we ignore the starting site index i. 

The Chou-Fasman propensity of residue x to conformation a is defined by 

where P{x) is the probability for residue x to appear, and P{x\a) the conditional probability for x to be at 
conformation a. Here we use only a logarithmic propensity, the logarithm of CF: 

LCF{x; a) = log CF{x; a). (2) 

As an extension of the Chou-Fasman propensity of residues, the propensity of residue pair xy to conformation 



af3 may be defined as 



r<i(a=y;a/?) = ^^^, d = k-j-l, (3) 



where Pd{xy) is the probability for residue pair x and y to appear with their site index difii'erence being d, 
and P4{xy\a(i) the conditional probability on the condition that the conformation states of x and y are a 
and /3, respectively. 

When a window flanking the pair xy is examined to infer the conformation a(5 of xy, a further extension 

of the Chou-Fasman propensity is 

Ra{xy;af3)= ^^^^^^^^ , d = k-j-l, (4) 

where Pd{xy,w) and Pd{xy,w\ap) are now probability for the whole window, i.e. xy and w. Making the 
assumption of independency, we introduce the conditional weight matrix Q^y of {I — 2) columns and 20 rows, 
whose entries describe the probability for a specific residue z to appear at some fianking site, say the n-th 
site from the window starting position. We denote the probability by Qd,n{z\xy)i and write 



Pd{xy,w) = Pd{xy)Pd{w\xy) = Pdixy) QdAo-i+r^xy)- (5) 

A similar simplification for Pd{xy,w\a(i) is 

Pd{xy, w\a(3) = Pd{xy\af3)Pd{w\xy, a(3) = Pd{xy\al3) Qd,n{0'i+n\xy, a(3), (6) 

where the meaning of Q d,n{z\xy i ocf^) is analogous to Qd,n{z\xy). The window score Id{xy\a(3) for pair xy 
to be at conformation af3 is then defined as the logarithmic ratio of Rd'- 



Id{xy;af}) = log 



Pd{xy\a(3) 
Pdixy) 



x:iog 



Qd,n{0'i+n\xy,aP) 

Qd,n{ai+n\xy) 



(7) 



So far we have not determined the window width I and the position of the inferred residue pair inside 
the window, especially the separation d of the two residues. To do this, we need a measure of the distance 
between two probability distributions. A well defined measure is the KuUback-Leibler (KL) distance or 
relative entropy (KuUback et al., 1959; KuUback, 1987; Sakamoto et al., 1986), which, for two distributions 
{pi} and {qi}, is given by 

KL{{pi}, {qi}) = Y,Pr \og{pi/qi). (8) 
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It corresponds a likelihood ratio, and, if pi is expanded around qi, its leading term is the distance. It is 
often to use the following symmetrized form 

D{{Pi}, {Qi}) = h[KH{Pi}, Ui}) + KL{{qi}, {pi})]. (9) 

The distance Dn{xy,a(3) = D[{Qd^n{z\xy,a(3)}z,{Qd,n{z\xy)}z] measures the power for site i + n to infer 
conformation a/3. Asymptotically Dn{xy,ap) approaches zero, when n becomes far away from the sites of 
pair xy. The power for a window to infer conformation a/3 of xy may be measured by {J2new ^n{xy, ctP)), 
the window sum distance averaged with the weight P{xy,a(3). We find a reasonable choice is Z = 16 and 
d = 0,1 with residue y being at 9-th site of the window. Detailed discussion will be published elsewhere. 
Due to limited samples, we consider only a/3 € {cc, ee, hh, ce, ch, ec, he} with eh and he excluded. 

2.2 Prediction steps 

Using scores Idixy.afi) and sliding windows, we may calculate scores of true and false windows for each 
combination of xy; a/3 and d. The threshold T^(xy] af3) is determined by the error rate 5% at which non-a/3 
conformations are wrongly predicted as a/3. For a given window, if its score Id{xy; a/3) is greater than the 
corresponding threshold Td{xy; ap), we say that 'a; at a', 'y at /3' and ^xy at a/3' are evidenced by the event 
{d; xy; a/3). 

Step 1: If ''xy at a/3' is evidenced by event {d; xy; a/3), and, at the same time, both 'a; at a' and 'y at /3' are 
further evidenced by some other events, we say that 'xy at a/3' is strongly confirmed. Determine all strongly 
confirmed pairs. 
Step 2: 

We now consider the case that 'xy at a/3' is evidenced, but not strongly confirmed. If either 'zx at 7a' 
or 'yz at /3Y is strongly confirmed, we say that 'xy at a/3' is weakly confirmed. Determine all the weakly 
confirmed pairs. 
Step 3: 

Prom a confirmed pair, either strongly or weakly, we calculate the score I{x; a) for single residue x to be 
at conformation a according to 

I{x; a) = max {hixy; a/3) - LCF{y; (3), I'^{zx; 7a) - LCF{z; 7)}, (10) 
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where only confirmed pairs are searched for maximum. The conformation of x is finally inferred as 

a* = arg^ max{J(a;; a)}, (11) 

i.e. a* is the a corresponding to the maximal I{x; a). Infer residue conformation for all the confirmed pairs. 
Step 4: 

In this step we expand already inferred h and e segments in both direction. Suppose residue a; be a 
candidate for elongating e. We calculate score I{x;e) according to (10), but now only pairs fit 'a; at e' are 
searched for maximum. If I{x; e) is positive, we assign conformation e to x. The elongation of h is similar. 
Step 5: 

At both ends of the sequence no full windows are available. The first and last two residues are always 
assigned to c. With the contribution of the missing sites set as zero, scores I{x; a) of some residue x in 
the end regions are calculated for the elongation of the already determined boundary conformation a in a 
similar way to the last step. For remaining residues we examine only the cases of four successive residues, 
say aiai+iai+20'i+Zi in the same confirmation a by calculating Jo(ojaj+i; aa) + /o(ai+2ai+3; aa), and then 
infer the conformation according to the largest positive score. 
Step 6: 

This final step is filtering. Each single residue h and e segment is discarded. A conformation segment 
hh is expanded to hhh according to whichever neighbor site has a large score for h. Each residue whose 
conformation cannot be predicted so far is assigned to conformation c. 

Let us explain the prediction steps in more words by a simple example. Suppose that total of five events 
are found for segment PDEFGHI of a sequence as shown as follows. 
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. . .ACLMNPDEFGHIKQRST. . . 

— e.c — 

e.e 

c.h 

ch- 

co- 
in step 1, EG at ec is strongly confirmed by the first three events. This is the only strongly confirmed pair in 
the segment. Based on the pair, in Step 2 we find that, all the remaining four events are weakly confirmed. 
In Step 3, we can easily infer 'P at e', 'E at e', 'G at c', and 'I at h'. Further calculation of I{H; h) and 
I{H, c) leads to 'H at h'. The conformation of PDEFGHI is now inferred as e-e-clili. Step 4 then fills up the 
gaps to get eeeechh. 



We create a nonredundant set of 1612 non-membrane proteins for training parameters from PDB_SELECT 
(Hobohm and Sander, 1994) with amino acid identity less than 25% issued on 25 September of 2001. The 
secondary structure for these sequences are taken from DSSP database (Kabsch and Sander, 1983). As 
mentioned above, the eight states of DSSP are coarse-grained into 3 states: /i, e and c. This learning set 
contains 268031 residues with known conformations, among which 94415 are h, 56510 are e, and 117106 are 
c. The size of the learning set is reasonable for training our parameters. 

To convert observed amino acid counts into frequencies or probabilities for scoring is a basic problem faced 
in training. A practical approach is to use pseudocounts (Aitchison and Dumsmore, 1972). We estimate 
background amino acid frequencies {px} directly from counts of the whole learning set. Then, we estimate 
the weight matrix element Q d,n{z\^y ^ '^P) from the count Nd^n{z\xy, a(3) of amino acid z at position n under 
the condition that residue pair xy with separation d is in conformation a(3 as follows. 



where Qn,z^ indicating specifically only n and z = tti^n, stands for Qd^n{(ii+n\xy,ct/3), and Nn^z stands for 
Nd,n{(ii+n\xy,aP). Here, the conditional probability Qn,z is estimated by using a pseudocount propotional 



3 Result 




(12) 
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to the background distribution p^- If A?'d^„(2;|a;y, a/3) is less than 10, the sample size is too small to 
reliably estimate the score. No such scores will be used for inference. Or, equivalently speaking, they are set 
to be negative infinity. 

In order to assess the accuracy of our approach, we use the following 2 test sets: Sets 1 and 2. A set of 124 
nonhomologous proteins is created from the representative database of Rost and Sander (1993) by removing 
subunits A and B of hemagglutinin 3hmg, which are designated as membrane protein by SCOP (Murzin 
et al, 1995). The 124 sequences and the learning set are not independent of each other according to HSSP 
database (Dodge, Schneider and Sander, 1998). That is, some proteins of the 124 sequences and certain 
proteins in the learning set belong to the same putative homologue family of HSSP. Removing these proteins 
from the 124 sequences and 5 seuqences with unknown amino acid segments longer than 6, we construct Set 
1 of 76 proteins, a subset of the 124 sequences. 34 proteins with known structures of the CASP4 database 
issued in December of 2000 are taken as Set 2. 

The predicted counts of each conformation type in the test sets are listed in Table 1. The quantities 
assessing accuracy on single residue level for a given test set are the total percent correct Qs, percent of types 
h and e predicted correctly (sensitivity Sn) and percent of predictions correct for types h and e (specificity 
Sp). Results obtained on the test sets are listed in Table 2 in comparison with the results of GOR IV and SSP 
(Solovyev and Salamov, 1991, 1994), another secondary structure predictor based on discriminant analysis 
using single sequence. The approach PSSRP performs very well. The overall value of Qs averaged over Sets 
1 and 2 is 70.2%. We show Q3 statistics for Sets 1 and 2 in Fig. 1. 

Generally, strongly confirmed pairs are the portion with a high confidence in prediction. Strongly con- 
firmed pairs cover 58.4% and 53.8% of Set 1 and 2, respectively. There is a clear correlation between the 
coverage rate or the percentage of the strongly confirmed pairs in the total length of a sequence and accuracy 
Qz, which are 79.5% and 77.4% for the coverage portion, respectively. To examine the correlation between 
the coverage rate C of strongly confirmed pairs and the whole sequence accuracy Qa, we conduct simple 
linear regression of Q3 on C for Sets 1 and 2, as shown in Fig. 2 with regression line Qa = 0.728C + 0.306. 
Correlation coeflacient r and standard deviation a are 0.779 and 0.0276 for Set 1, while for Set 2, they are 
0.830 and 0.0209, and for the two sets total are 0.795 and 0.0253. Thus, the coverage rate of the strongly 
confirmed pairs provides us a self-checking confidence level of the prediction accuracy. 

Qs measure gives an overall number of residues predicted correctly. It is well known that single-residue 
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accuracy sometimes poorly reflects the quality of prediction. Measures concentrating on secondary structure 
segment prediction accuracy would better reflect the nature of structure. A simple segment overlap measure 
is: a segment is considered correctly predicted if the predicted and observed segments have at least two 
amino acids in common (Taylor, 1984). Results of this segment prediction accuracy from our approach are 
listed in Table 3. 

4 Discussions 

We have presented an improved approach using single protein sequence to predict secondary structure. 
The improvement is achieved by including triple correlations. Most recent improvements in accuracy come 
from methods which arc capable to consider correlations nonlocal in sequence. Combining evolutionary 
information via multiple alignments of homologous sequences is a main way to include such correlations. 
Although methods using single sequence generally cannot cope with nonlocal correlations easily, their simple 
nature of requiring least computation in the prediction step is still attractive. 

There are rooms for further improvement of PSSRP. The original scores are obtained for residue pairs. 
The scores are then converted to those for single residues in the prediction Step 3. It is possible to construct a 
scoring system to directly use pair scores by introducing appropriate weighting. We may tune the thresholds 
to compromise Sp with 5„. We may integrate the segment length statistics into the approach, e.g. by 
dynamic programming. 

The size of the amino acid alphabets is 20. Number of parameters increases drastically with the order 
of correlations considered. A statistical model c;ontaining a tremendous number of parameters will require 
a huge learning set to train parameters. Furthermore, an over-complicated model can easily result in over- 
fitting. It seems that the overfitting is not too serious. A way to reduce the number of parameters is to 
coarse-grain the 20 amino acids into a small number of categories. This is under study. 

We thank Dr. Shan Guan and Prof. Jing-Chu Luo for their kindly help to our work. 
This work was supported in part by the Special Funds for Major National Basic Research 
Projects and the National Natural Science Foundation of China. 
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Table 1. Predicted counts for each conformation type. 
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Table 2. 


Single residue accuracies. 
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Table 3. A comparison of segment prediction accuracy for short and long helices and sheets. Here a 
simple measure of segment overlap is used: a predicted segment is couted as a true positive (TP) if the 
predicted segment and an observed segments have at least two residues in commen. With this definition the 
sensitivity Sn and specificity Sp are calculated. 
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Figure 1: Qs statistics for Sets 1 and 2. Qa is calculated for each sequence, and the size of the bins is 0.05. 
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Figure 2: Correlation between whole sequence prediction accuracy Qs and the coverage rate or the percentage 
of the strongly confirmed pairs in the total length of a sequence. 
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